Improving Statistical Machine Translation Performance by Training Data Selection and Optimization

نویسندگان

  • Yajuan Lü
  • Jin Huang
  • Qun Liu
چکیده

Parallel corpus is an indispensable resource for translation model training in statistical machine translation (SMT). Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting full potential of the existing parallel corpora. Two kinds of methods are proposed: offline data optimization and online model optimization. The offline method adapts the training data by redistributing the weight of each training sentence pairs. The online method adapts the translation model by redistributing the weight of each predefined submodels. Information retrieval model is used for the weighting scheme in both methods. Experimental results show that without using any additional resource, both methods can improve SMT performance significantly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Submodularity for Data Selection in Machine Translation

We introduce submodular optimization to the problem of training data subset selection for statistical machine translation (SMT). By explicitly formulating data selection as a submodular program, we obtain fast scalable selection algorithms with mathematical performance guarantees, resulting in a unified framework that clarifies existing approaches and also makes both new and many previous appro...

متن کامل

Submodularity for Data Selection in Statistical Machine Translation

We introduce submodular optimization to the problem of training data subset selection for statistical machine translation (SMT). By explicitly formulating data selection as a submodular program, we obtain fast scalable selection algorithms with mathematical performance guarantees, resulting in a unified framework that clarifies existing approaches and also makes both new and many previous appro...

متن کامل

Considerations in Maximum Mutual Information and Minimum Classi- fication Error training for Statistical Machine Translation

Discriminative training methods are used in statistical machine translation to effectively introduce and combine additional knowledge sources within the translation process. Although these methods are described in the accompanying literature and comparative studies are available for speech recognition, additional considerations are introduced when applying discriminative training to statistical...

متن کامل

Phd Defense Presentation 2219 Engineering Building " Da a Analy I and Selec Ion for S a I Ical Macine Tran La Ion "

Statistical Machine Translation has received significant attention from the academic community over the past decade. This research has led to significant improvements in machine translation quality. As a result, it is widely adopted in the industry (Google, Microsoft, Twitter, Facebook, ...etc.) as well as the government (http:/ /nist.gov). The biggest factor in this improvement has been the av...

متن کامل

Low Cost Portability for Statistical Machine Translation based on N-gram Coverage

Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences con...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007